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ABSTRACT 

A special case of examinee choice, the Optional Essay 
Problem, is examined from the point of view of test equating. The 
Optional Essay Problem involves equating essay scores when the 
examinees are required to select an optional essay topic from a list 
of topics in addition to taking a mandatory test required of all 
examinees. The conditions that must be satisfied if the null 
hypothesis of equal difficulty of the essays holds true are derived. 
If this hypothesis, called "Livingston's Null Hypothesis," holds 
true, there is no need to equate the scores. The conditions take the 
form of inequalities about unobservable quantities that may be 
displayed graphically. They are illustrated with a real example from 
the Advanced Placement Examinations. S. A. Livingston's (1988) 
proposal of adjusting essay scores in the Optional Essay Problem is 
analyzed and explained from the perspective of test equating, and his 
proposal is generalized to two new proposals that are explicit about 
the assumptions they make concerning the unobserved data. These 
methods are illustrated, and the results for adjusting optional essay 
scores are used to propose comparable procedures for directly 
adjusting linear composite scores that include mandatory and optional 
test scores. Six tables and five figures present analysis data. 
(Contains 12 references.) (SLD) 
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Abstract: We examine a special case of examinee choice, the Optional 
Essay Problem, from the point of view of test equating. The Optional Essay 
Problem involves equating essay scores when the examinees are required to 
select an optional essay topic from a list of topics in addition to taking a 
mandatory test required of all examinees. We derive conditions that must be 
satisfied if the null hypothesis of 'equal difficulty' of the essays is true. (We 
call this Livingston's Null Hypothesis.) If this hypothesis holds then there is 
no need to equate the scores on the optional essays. Our conditions take the 
form of inequalities about unobservable quantities that may be displayed 
graphically. We illustrate them using a real example from the Advanced 
Placement Examinations. Then we analyze Livingston's (1988) proposal for 
adjusting essay scores in the Optional Essay Problem and explain it from the 
perspective of test equating. We use our explanation to generalize his 
proposal to two new proposals that are explicit about the assumptions they 
make concerning the unobserved data. (We argue that every method for 
adjusting the essay scores in the Optional Essay Problem must make 
assumptions about unobserved data.) We illustrate the adjustment methods 
with an example from the Advanced Placement Examinations. Finally, we 
use the results for adjusting optional essay scores to propose comparable 
procedures for directly adjusting linear composite scores that include both a 
mandatory and an optional test score. 




1. INTRODUCTION 



Testing programs often have examinations that consist of both 
mandatory and optional parts. For example, many of the Advanced 
Placement Examinations have a multiple-choice portion that is required of 
all examinees and an additional set of essay topics from which each 
examinee must choose one or more on which to write. The 'optionality' of 
the essay is only in the topic chosen, not in whether or not to write the essay 
portion of the exam. One result of these choices is that examinees do not all 
take the same total test, and, more importantly, their own choices determine 
important features of which complete test they do take. Allowing examinee 
choice in such tests is often justified as a way of preventing examinees from 
having to work on a 'large' test item (like an essay) that they feel is 
inappropriate for them. Topics can be inappropriate for various reasons, 
such as a special curriculum used in a course taken by the examinee, or their 
academic concentration in the humanities versus the sciences. 

In the applications that we have in mind, the scores given to the 
optional portions of the test usually involve some subjective element; human 
graders evaluate essays or problem solutions and assign scores to them. In 
this type of grading, it may be difficult to apply exactly common grading 
standards across different problems or to essays written on different topics. 
In addition, it may be difficult to construct essay topics or problems that are 
of equal difficulty. These considerations, in turn, can lead to cases of 
unfairness to examinees that may undermine the good intentions that 
justified allowing examinee choice in the first place. 

For example, suppose that, in addition to the other mandatory parts of 
the test, examinees must select one essay topic from a set of five topics. 
Suppose also that topic 1 is inherently harder than the other topics or that the 
grading of topic 1 is more stringent than for the other topics. Examinees who 
had the misfortune to select topic 1 are possibly disadvantaged by their 
choice. Their scores are lower than they would have been had they chosen 
another topic. Can we separate the wisdom of an examinee's choice of topics 
from the effects of unintended differential difficulty or differential grading 
standards? This is the general problem of interest to us. 

Differential severity of essay or problem grading and differences in 
problem difficulty may not be obvious or intentional and this may make any 
attempt to regrade all the essays or problems with new grading standards 
unlikely to produce comparable results. When this occurs, statistical 
adjustments to the scores (score equating) may be required to achieve a fair 
test for all examinees. 

To give focus to the paper we will consider in detail a special case of 
examinee choice that we call the Optional Essay Problem. We remark that 
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we do not limit the applicability of the Optional Essay Problem to cases 
where the optional tests are actual 'essays'. For example, they also might be 
math or science problems that the examinees must work out, showing all the 
steps in their reasoning, which are then graded by one or more problem 
readers. 

THE OPTIONAL ESSAY PROBLEM: Suppose that the complete test is 
made up of a mandatory part, with raw-score denoted by X, and a single 
optional 'essay' ; and that each examinee must select a topic for the optional 
essay from a list of K topics. If an examinee chooses topic i, we denote the 
raw score on essay topic i by Y r 

The problem, then, is to equate the scores on the optional essay topics 
so that the examinees are not unfairly disadvantaged if there are differences 
in difficulty or in the severity of the grading across the topics. The only data 
available from a single examinee is a pair, (X, Yj), for some topic i that 

varies for each examinee. 

Our approach is to treat this as a test equating problem with missing 
data. The missing data are the scores on all the essays that the examinee did 
not select. If an examinee chooses to write on topic 1, then Yj is observed, 

but Y 2 , Y 3 Y K are all missing for that examinee. The way that we will 

decide on the need to equate the essay scores is to estimate what the 
marginal distributions of Yj , Y 2 , and Y K would be if each examinee had 

been assigned an essay topic at random, i.e., had exercised no choice of 
essay topic. If the resulting estimated distributions of some of the scores 

are notably different from the others, then equating may be necessary. We 
will use linear, observed-score equating methods in this paper because they 
involve only first and second moments of distributions and produce simple 
linear equating functions (Angoff, 1971 ; Holland and Rubin, 1982; Petersen, 
Kolen, and Hoover, 1989). However, the more general observed- score, 
'kernel equating' methods described by Holland and Thayer (1989) also fit 
into the scheme described here. We will summarize some simple facts about 
linear observed-score equating after we discuss Livingston's Null 
Hypothesis in section 2. 

A basic assumption of our approach is that it makes sense to consider 
the essay scores for the topics that an examinee did not select as missing data 
(i.e., data that could have been observed but wasn't). This is a subtle point 
because it assumes that each examinee could have selected a different essay 
topic from the one selected. This is an assumption about the strength of the 
determinants of that choice. There is an implicit 'similarity' between the K 
'essay topics' in our approach. This similarity might not be plausible in 
some instances of examinee choice, and our methods would not necessarily 
be applicable to such settings. For example, examinees usually select foreign 
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language Achievement Tests on the basis of the languages that they studied 
in school (or have other familiarity with). The determinants of an examinee's 
selection to take a German rather than a French Achievement Test are very 
strong and it is often implausible to imagine such an examinee selecting a 
language exam for which they have no preparation. It may not be useful to 
regard this case of examinee choice as a problem of missing data. 

A useful criterion for testing whether or not our approach might apply 
to a given situation is to ask if it would be appropriate to assign the essay 
topics at random to the examinees instead of letting them exercise their own 
choice in the selection. When the choice between topics is relatively hard for 
examinees to make (i.e., the choices are not strongly determined) then 
random assignment might be appropriate, but when it is easy for examinees 
to choose between the options (i.e., the choices are strongly determined), 
random assignment is probably not appropriate. Wainer, and Thissen (1993) 
use 'big choice' and 'little choice' to distinguish between choices that we 
have described as 'easy' or 'hard.' In practice, when there are several essays 
topics to choose from, an examinee will find it easy to eliminate some topics 
from consideration and hard to choose from the rest. 

When random assignment of topics is inappropriate, the choices the 
examinees make become an essential part of the test and this raises 
important and serious issues of score comparability that are not easily settled 
either by fiat or by psychometric means, e.g., see Wainer and Thissen 
(1993). 

Finally, we think it is important to remember that the question of 
whether or not to equate the scores on the optional essays arises only 
because examinees exercise choice in the selection of topics. If the topics 
had been assigned at random to the examinees then they would obviously 
have to be equated and standard random-group methods would be 
appropriate (Braun and Holland, 1982). One of the interesting things about 
the Optional Essay Problem is that examinee choice has the dual effect of (a) 
making the choice of an appropriate equating technique uncertain and (b) 
calling into question whether or not equating is necessary at all. 

The remainder of this paper is organized as follows. Section 2 
introduces notation and considers the situation in which it is unnecessary to 
equate the essay topics, we call this 'Livingston's Null Hypothesis.' In 
section 3 we derive inequalities that must be satisfied by the data if 
Livingston's Null Hypothesis is true, and in section 4 we illustrate these 
ideas with a real data example. In section 5 we show how consideration of 
the mandatory test score, X, can imply additional inequalities that must be 
satisfied by the data, and then illustrate these results of using X in section 6. 
Section 7 addresses what to do if we reject Livingston's Null Hypothesis, 
and decide to make score adjustments. We first analyze a proposal of 
Livingston's and then use our analysis to generalize Livingston's approach 
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to achieve his goals. Section 8 illustrates our proposal and Livingston's 
proposal on real data. Section 9 generalizes the proposals of section 7 to 
procedures for adjusting composite scores directly rather than simply 
adjustirg the essay topic scores and then using them interchangeably in a 
linear composite with X. Finally, section 10 makes a few additional points 
and summarizes the rest of the paper. 



2. LIVINGSTON'S NULL HYPOTHESIS 

Livingston (1988) suggests that evidence concerning the null 
hypothesis of 'equally difficult essay questions' may be used to specify the 
amount of 'correction' given to the essay scores in the Optional Essay 
Problem. 

The attempt to produce equally difficult questions may not succeed 
completely, but in the absence of any statistical information to the 
contrary it provides a reason for considering raw scores on the 
alternate questions to be comparable. (Livingston, 1988, page 3) 

His position is that if the topic selection, the scoring rubrics, the grading 
instructions and the training of the essay readers are all carefully 
implemented then the null hypothesis that the raw scores on the various 
topics are comparable may be valid and there may be no need to equate or 
otherwise adjust the essay scores, Y 1? . . . ,Y K . Are there any data routinely 
collected in the Optional Essay Problem that can shed light on Livingston's 
Null Hypothesis? In order to answer this question we need to give more 
precision to its statement, which we shall do in defining LHq, below. 

In observed-score test equating we compute or estimate the marginal 
distribution of each test score on a common population of examinees. 
Differences between these distributions are then used to devise adjustments 
to the scores--e.g., tests with higher mean scores are easier for the population 
and these scores are, therefore, adjusted downwards while tests with lower 
mean scores are harder for the population and these scores are adjusted 
upwards. If there are no differences among the score distributions then no 
adjustments are necessary since the tests are equally difficult for the 
population. 

We shall interpret Livingston's Null Hypothesis in these terms, but 
which population shall we use? In the Optional Essay Problem, examinees 
get to select which topic they will write on, and therefore the sub-population 
(Pi) choosing essay topic i is non-random and subject to selection. In view 
of this, we shall use the whole population of examinees (all those taking the 
mandatory test X) as the population (P) on which to compute the observed- 
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score equating function. We recognize that in the Optional Essay Problem 
self-selection is operating and that this will require us to make assumptions 
about the distribution of essay scores the examinees would have received if 
they had chosen to write on a topic different from the ones that they chose. 
Our approach is to see if these assumptions about selection are compatible 
with the observed data and Livingston's Null Hypothesis. Furthermore, 
because we are linearly equating the essay scores we will only concern 
ourselves with the first and second moments of distributions; we regard two 
distributions as the same if they have the same mean and variance. This 
leads to the following precise version of Livingston's Null Hypothesis. 

LH 0 : JJ- = H-i = M-2 = • • • = M-K' m & 

o 2 = Ot 2 = o 2 2 = . . . = G K 2 , 

where 

111 = E(Yj), and of = Var(Yj). 

Thus, jij and <J j 2 are the mean and variance of the essay score Yj over the 
whole population (or equivalently, in a large random sample not subject to 
selection). In our missing data approach to the Optional Essay Problem, we 
let Rj be the 0/1 indicator for the examinee's choice of essay topic, i.e., 

Rj = 1 if the examinee chooses topic i, 
Rj = 0 otherwise. 

In the language of Little and Rubin (1987), the Rjare the missing data 
indicator variables. The mean and variance of Yj for the examinees who 
chose topic i are then defined and denoted by 

^i! = E(Yi I Rj = 1), and Gj! 2 = Var(Yi I % = !)• 

The quantities, and Ojj 2 , are estimated by the sample mean and variance 
for the examinees who chose topic i, and in general they are not equal to the 
population mean and variance, |Xj and Gj 2 that are referred to in Livingston's 
Null Hypothesis, LH 0 . Along with ji^ and cjj] 2 , there are the corresponding 
quantities for the examinees who did not choose topic i, i.e., for whom Rj = 
0, 



m 0 = E(Yj i Ri = 0) and o i0 2 = Var(Y i I R { = 0). 



Finally, let 

p i = Prob{R i =l}, 

the probability of selecting topic i, and 

q i =l-p i = Prob{R i = 0}, 
the probability of selecting a topic other than topic i. 

LINEAR OBSERVED-SCORE EQUATING: One reason for stating LH 0 in 

terms of the means and variances is that these quantities play a central role in 
the simplest of all test equating methods—linear observed-score equating. All 
observed-score equating methods take place on a specific population of 
examinees (Braun and Holland, 1982). In our problem, this will be the 
population of all examinees taking the test X, population P. The test scores 
to be equated, say Yj and Yj, however are not observed on this P but on the 

self-selected sub-populations (Pj and Pj) for which Rj = 1 and Rj = 1, 

respectively. Hence, the relevant means and variances needed to equate Yj to 

Yj are not jj^j, jXjj, a^ 2 and Oji 2 , which are estimated from the self-selected, 

observed data for Yj and Yj. Instead, the relevant means and variances are 

Uj, jjj, Oi 2 and o- 2 . 

The linear equating function for equating Y x to Yj on the population 

taking X is given by the following well-known linear equating formula: 

Yj(yj) = Uj + (Oj / G^fy - Pj), (1 ) 

where y^ denotes a possible Yj-score and Yj(yj) denotes the transformation 

of this score to the scale of Yj. This transformation of Yj-scores will produce 

scores with the same mean and variance over P as Yj has. We note that if 

LH 0 is true then the transformation defined in (1) is simply the identity 

transformation, indicating that no adjustment is necessary to equate the 
scores of Yj. to Yj. 

An essential feature of this approach is that it makes explicit the fact 
that the equating of Yj to Yj will involve estimates of \i v pj, G r and Oj, and 

that these in turn will involve making assumptions about the missing data. In 
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our opinion, it is impossible to equate the essays in the Optional Essay 
Problem without making some sort of assumptions, tacit or explicit, about 
the missing data. Later we will examine Livingston's (1988) 'ad hoc' 
procedure for adjusting essay scores in the Optional Essay Problem with this 
in mind. 



3. AN INEQUALITY FOR \l AND o 2 

Theorem 1 summarizes a relationship that must hold between \i, G 2 
and the various quantities defined above when Livingston's Hypothesis 
holds. 

T heorem 1 : Under LHq, we have 

a 2 = + o i0 2 qi + (Pi/qi) (pn - \i ) 2 (2) 

The proof of theorem I is a straight-forward-but-tedious calculation 
based on computing m and a^ 2 by conditioning on the two values of R v and 

then replacing ji^ and Oj 2 by |i and a 2 . Equation (2) expresses the common 

values, \i and a 2 , assumed in LHq, in terms of quantities that can be 

estimated by the observed data, and the unknown variance, G i0 2 > which can 

not be so estimated. 

Our approach to testing Livingston's Hypothesis is to make 
assumptions about in terms of its relation to and to see what 

implications these assumptions have for \l and a 2 via equation (2). We 
define Aj as the ratio 

A i = OiO^il' ( 3 ) 

so 

c i0 2 = (A^ 2 G U 2 (4) 

Next we exploit the phenomenon that test score variances are usually 
quite similar across different sub-populations even though the means of the 
test scores may vary widely. For example see Table 2 from Holland and 
Wainer (1990). This can be expressed by inequalities of the form 



A L <A i <A u , 



(5) 



where A L and Ajj are a priori bounds. Examples of plausible values for A L 
and Ay might be A L = .90 and Ay = 1.10, but other values for these bounds 
on the Ai may be useful too. We note in passing that a Bayesian analysis in 
which the A i are assumed to have a continuous distribution centered on 1 

may be developed, but we have not pursued this approach here. 

We may combine the inequalities in (5) with the formula given in (2) 
to obtain conditions that the common mean and variance, fi and a 2 , must 
satisfy if (5) and LHq are true for the specified a priori bounds A L and Ay. 

These inequalities are given in Theorem 2. 

Theorem 2: If LH 0 is true and if the inequalities in (5) are satisfied then ji 
and a 2 must satisfy these two inequalities for each essay topic (i = 1 to K), 

(a) o 2 < Oi^tpi + (Ay) 2 qj + (p^) (j^ - ji ) 2 (6) 

and 

(b) o 2 > o n 2 [ Pi + (A L ) 2 qi ] + ( Pi /qi) (jLLi! - |l ) 2 . (7) 

Statements (6) and (7) define a U-shaped region sandwiched between 
two parallel parabolas in die (jx, c 2 )-plane, with ji along the horizontal axis 
and a 2 along the vertical axis. The two parabolas are defined by the 
quadratic equations formed from (6) and (7) by replacing the inequalities 
with equalities. Figure 1 shows the two parallel parabolas for the data for 
essay topic 2 from Table 1 introduced in section 4, below. In Figure 1 , A L = 

.90 and Ay =1.10. 

(Insert Figure 1 about here) 



For any essay topic i, these two parabolas share a common vertical 
line of symmetry, at ji = and differ only in the height of their minima. 

The inequalities (6) and (7) together require the possible values for the 
common mean and variance in LHq to lie in this U-shaped region. However, 
this must hold for each i = 1 to K, so the region of (jx, a 2 )-values that are 
consistent both with the data and the inequalities in (5) is the intersection of 
K such U-shaped regions in the (jx, a 2 )-plane, see Figure 2 in section 4, 
below. Depending on the values of Al and Ay, this intersection may or may 

not be empty. If it is empty, then this version of Livingston's Hypothesis is 
not consistent with the data and the assumptions about the ratios of the 
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variances in (3) and (5). When this region is not empty it specifies all of the 
pairs of values of (ji, a 2 ) that are consistent with LHq, the data and our a 

priori assumptions. 
4. EXAMPLE 1 

Table 1 gives the means, variances and covariances for an example of 
the Optional Essay Problem, the 1987 administration of the Advanced 
Placement Examination for European History. In this example, the optional 
essay topics are topics 2 through 7 while topic 1 is required of all 
examinees. In the example, topic 1 is ignored. 

(Insert Table 1 about here) 

Figure 2 shows the 6 pairs of parabolas for A L = .90, and Ayj = 1.10. 
Their region of intersection is non-empty and is shaded in Figure 1 . Figure 3 
shows the 6 pairs of parabolas for A L = .95, and Ay = 1 .05. The region of 

intersection for Figure 3 is empty. Thus, in this example, LHq is consistent 
with the data and ratios of standard deviations between 90 and 1 10 percent, 
whereas it is not consistent with narrower limits on these ratios, i.e., between 
95 and 105 percent. 

(Insert Figures 2 and 3 about here) 



5. BRINGING X INTO THE PICTURE 

Information from the mandatory part of the test may provide 
information about the relative 'difficulty' of the essay topics. Table 2 
displays Livingston's (1988) example of a 'reversal' in the order of the 
means of the optional essay scores from the order of the means of the 
mandatory multiple-choice test. 

(Insert Table 2 about here) 

The point of Livingston's example is that the mean of the multiple- 
choice score for the group selecting essay 6 (41.5) is the lowest mean of the 
five groups, but the mean score on essay 6 for those same examinees (7.1) is 
the highest of the five groups. This is an extreme example of a 'reversal' of 
the essay and the multiple-choice score means. (There is also an example of 
a much smaller reversal in Table 1, involving essays 2 and 4.) There is 
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usually a positive correlation between test scores, so this reversal may 
suggest that the grading of essay topic 6 is unduly easy relative to the 
grading of the other essay topics. 

Notice that the evidence here involves X, the test score that all 
examinees have. The inequalities developed in section 3 do not involve X. 
Our goal now is to derive additional inequalities in the spirit of those of 
section 3, but which do involve X. 

The first step is to reexamine the form of Livingston's Null 
Hypothesis in LH 0 . In the Advanced Placement example of the Optional 
Essay Problem, the multiple-choice score X is not simply another variable 
that is obtained from each examinee, but rather it is combined with the essay 
score to produce a final linear composite raw-score that is then used to form 
reported scores. Hence, consider the linear composite score 



where w > 0 is the relative 'weight' given to the essay part of the composite. 
The composite score S^is subscripted with an i because it is based on X and 

essay i. Note that w does not depend on which essay the examinee selects. 
Our idea is to replace LH 0 with an equivalent hypothesis about the mean and 
variance of the composite scores. Again, the distribution of the composites is 
taken over all of the examinees, not just those selecting essay i. The resulting 
generalized version of Livingston's Null Hypothesis can be stated initially 
as: 



However, (9) and (10) can be re-expressed in terms of other more 
basic quantities. First of all, we see that (9) is equivalent to the first part of 
LHq because 



Si = X + wYi, 



(8) 



E(S 1 ) = E(S 2 ) = ... = E(S K ), and 



(9) 



VarCSi) = Var(S 2 ) = ...= Var(S K ). 



(10) 



E(Si) = ji x + wm 



(11) 



where jix I s me mean °f X over all of the examinees, and w is non-zero. 
Secondly, we have 
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where a x is the variance of X over all of the examinees and c XY .is the 
co variance of X and Yj over all of the examinees. When i ^ j, 

Var(Si) = Var(Sj) (13) 

if and only if 

(w/2)[Oj2 - dj 2 ] = a XYj - ^XYi- . ( 14 ) 

We want (14) to hold for all i ^ j and for any value of w that we may choose. 
The weight, w, is usually determined from considerations external to the 
question of equating the essay scores and so we require that (14) holds no 
matter what the choice of w is. This will happen if and only if 

o^Gj 2 and o XY .= c XYj , (15) 
for all i * j . 

Combining these results we obtain the Generalized Livingston 
Hypothesis that is parallel to LHq. 

Theorem 3: If (9) and (10) are to hold for any choice of the weight w then 
(9) and (10) may be re-expressed as: 

GLH 0 : \i = = M-2 = ■ ■ • = M-K' ( 16 > 

a 2 = C] 2 = o 2 2 = . . . = o K 2 , and 

(17) 

c =<? XYl =a XY2 =... = o XYK . (18) 

The version of the Generalized Livingston Hypothesis expressed in 
(16)-(18) also can be motivated in other ways, for example, by adding to 
LH 0 the additional requirement that all the Yj correlate equally with X over 

the entire population of examinees. 

Next we give the results that parallel Theorems 1 and 2 for the 
covariance, c, in (18). 

Theorem 4: Under GLH 0 we have 
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c = °xYii Pi + °xYjO qi + (Pi /c ii ) (Mxii - mx) (mi - M-) 



(19) 



where 



a XYi i = Cov(X, Yj 1 Ri = 1 ), a XYi0 = Cov(X, Y^R^ 0) 

JiXii =E(XIRi=l), and H X = E(X). 

The proof of Theorem 4 parallels that of Theorem 1 . 

When the correlations between X and Yj are positive, the natural 
bounds for points in the c)-plane are obtained from inequalities for the 
ratios of the correlation of X and Y { for = 1 and R t = 0 as well as the 
previous bounds assumed on the ratio of the variance of Yj when R| = 1 and 
Ri = 0. Let B £ be defined by 

B i=PXYiC/PXYil ( 2 °) 

where PxYiO * s ^ correlation from the covariance Cov(X, Yj I Rj = 0), etc. 
Just as for the Ai, it may be plausible to assume that a priori bounds for 
exist, i.e., 

B L <B i <B u (21) 

for some values of B L and By, such as B L - .90 and By =1.10. The next 
theorem summarizes the resulting two inequalities for c and ji. 

Theorem 5: If GLHq is true and the inequalities (5) and(21) hold then c and 
jj. satisfy the following two inequalities for all i = 1 to K. 

(a) c < o XYi i(Pi + qi (Gxiotoxil) A U B U> 



+ (Pi ^i ) Qixil " M-x) (Mil " M) 



(22) 



and 



(b) c > a XYi i(Pi + ^ (<W 0 Xil) A L B L> 



+ (Pi ^i ) (P-xil " Mx) (Mil - M)» 



(23) 
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where 

°Xil 2 = Var ( x 1 R i = !)» <*Xi0 2 = Var ( x 1 R i = °>- 

Thus, Gxil 2 is the variance of X in the group who choose essay topic i and 
a Xi0 2 * s ^ e variance of X for the rest of the examinees. 

Inequalities (22) and (23) define a region in the c)-plane (u. 
horizontal and c vertical) that is a strip lying between two parallel lines, both 
with slope 

-(Pi/qiX^Xil-fe)- (24) 

Theorem 5 says that the region of (ji, c)-values that are compatible with 
GLH 0 , the data and the a priori bounds A L , Ay, B L , By is the intersection 

of these K strips. It is possible, for particular choices of GxYil' ^Xil' a XiO 
A L , Ay, B L , By, |i Xil , M-x ^ !% ^ no P a ^ r °f va l ues of (jx, c) can satisfy 
all of the inequalities for i = 1, . . . , K. Hence, by bringing the mandatory 
portion of the test into the picture we double the number of inequalities that 
must be satisfied. This can make it even harder for GLHq to be acceptable in 

light of the data and our a priori assumptions expressed by the bounds Al, 
A u? B L , By. 

In Livingston's (1988) analysis he considers the case when the 
correlations are zero. Our use of ratios of correlations will not work in that 
case. However, equation (19) is valid even if PxYil = ®> and inequalities 
similar to (22) and (23) can be developed for this case. The region of 
possible values for (ji, c) under GLH 0 will then be the intersection of 

several strips in the (u., c)-plane that depend on the data, the a priori bounds 
A L and Ay, and our a priori assumptions on the possible sizes of the 
correlations PxYiO- ^ e ^o not see how the size of Pxy^ can tell us much 
about the plausibility of either GLH 0 or LH 0 , contrary to the position taken 
by Livingston. In practice we expect PxYil t0 De modest and positive and so 
we will not pursue the case of PxYil = ® further- 
Even though we cannot illustrate the following point with the data in 
Table 2, inequalities (22) and (23) do allow us to see how 'reversals' like the 
one in Table 2 might lead to violations of Livingston's Null Hypothesis. The 
slope (24) of the parallel lines in the (ji, c)-plane specified by (22) and (23) 
is positive when the mean of the mandatory section scores for examinees 
who chose essay topic i is less than the overall mean of the scores on the 
mandatory section, and is negative when it is greater than the overall mean. 
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In addition, the larger is the more to the right these lines are shifted. 

When we have a 'reversal,' as in Table 2, a pair of lines far to the right 
(large value for ji^) have large positive slopes rather than the expected 

negative slopes. This can easily make the intersection region empty so that 
there are no Qi, c)-values that are compatible with Livingston's Null 
Hypothesis. 



6. EXAMPLE 2 

We return to the data in Table 1 . The relationship 



allows us to compute axiO from the items in Table 1 for use in (22) and (23). 
All other values needed are presented in Table 1. Figure 4 shows the 
resulting pairs of lines for the limits A L = -90, and Ay = 1.10, and B L = .90, 
and By = 1.10. Figure 5 shows the resulting pairs of lines for the narrower 
limits A L = .95, and A v = 1.05, and B L = .95, and By = 1.05. 



In both Figures 4 and 5 the intersection of the six regions is non- 
empty so that there are combinations of u. and c that are compatible with the 
data and the restrictions (5) and (21). 

If we adopt the range A L = B L = .90, and Ay = By = 1.10, then 

Livingston's Hypothesis (expressed either as GLHq or LHq) is compatible 

with the data in Table 1, and we may conclude that because we cannot reject 
it there is no need to equate or otherwise adjust the essay scores. For those 
who think that equating is desirable in this example the onus is to provide 
evidence that the above bounds on the A x and Bj are too big. The data in 

Table 1 can not provide such evidence. Moreover, no data routinely 
collected in the Optional Essay Problem can provide it. However, through 
comparisons with other tests and data collected in special experiments (e.g., 
Wang, Wainer and Thissen, 1993), it may be possible to build up useful 
prior knowledge for the a priori choices of Al, Bl, Ay and By. For 

example, the 50 observed standard deviations of SAT (V+M)-scores by State 
given in Table 2 of Holland and Wainer (1990) range from .87 to 1.1 1 times 
their mean value of 201. The standard deviations of X and the Yj in our 
Table 1 also show a similar small range of values. While these data do not 



°X 2 = Pi °Xil 2 + <Ji <W+ (Pi 1 liXfeil - ^x) 2 



(25) 



(Insert Figures 4 and 5 about here) 
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give direct evidence about the A i? they do support the intuition that the 
variation of standard deviations of test scores across populations is often 
small, and that the choices of A L = .90, and Ay = 1.10 do not give an 

unduly narrow range. 

From Figure 2 the resulting range of possible u. -values is roughly 
between 6 and 8 and from Figure 4 it is roughly between 6 and 9. Thus, jx < 
6 is not compatible with these data and with assumptions (5) and (21). We 
note that the average essay scores for each row in Table 1 all exceed 6 
except for essay topic 7 for which the mean is 5.9, slightly below 6. 
Therefore, the examinees who chose essay 7 are not from the top of the 
distribution of scores for that essay. According to this analysis, examinees 
who would have received high scores on essay 7 actually chose other essay 
topics to write on— and probably got high scores on them. Examinees who 
did choose topic 7 were from the low end of the score distribution and 
probably would have received low scores on any essay topic. The question 
of whether or not the examinees who chose topic 7 would have done better 
to have chosen a different essay topic can not be answered from the analysis 
or data presented here. 



7. WHAT IF WE REJECT LIVINGSTON'S NULL HYPOTHESIS? 



In this section we first discuss Livingston's 'ad hoc' proposal for 
equating the essays in the Optional Essay Problem, and we then make 
alternative proposals that are 'in the same spirit as Livingston's but which are 
more simply motivated, in our opinion. Our approach involves interpreting 
various equations given in Livingston (1988) in terms of observed-score test 
equating. Livingston does not use this interpretation to describe his formulas. 

In order to have a simple way of referring to the several populations of 
examinees that arise in the analysis we remind the reader that P is the entire 
population of examinees who take X and write on some essay topic from the 
list of K topics and that P^ is the sub-population of P that writes on topic i. 

LIVINGSTON'S 'AD HOC' ADJUSTMENT: Livingston's procedure is 
fairly complicated, so we break it down into three steps. 

Step 1. Equate Y { to each of the other Yj and for examinees in V { 

obtain the converted value of the observed yj, to the scales of the other Yj's. 

Call these converted values Y i j*(y i ). Livingston uses a special version of 

equating that we will discuss momentarily. 



o 17 \, 

ERIC 



Step 2. Obtain 'imputed' values, yj,i mpu ted(yi)» for j * i for each 
examinee in These imputed values are weighted averages of the observed 
value yi and its equated value in the Yj scale, Y^*^) of the form: 



Yj, imputed(yi) = (1 - PXYjl) Yi + ?XY- } 1 Y ij*CYi)- 



(26) 



Step 3. Compute the adjusted essay score as the simple average of the 
1 observed and the K - 1 imputed essay scores for each examinee: 

yadj = Cfi + S j*i{yj, imputed(yi)}] /K - < 2? ) 
If we define Y^*^^ as 

Yii*(yi) = n* (28) 

then we may combine the effect of steps 2 and 3 into a simple expression for 

>'adj » 38 follows- 

Let p denote the average of all the correlations, PxYjl : 

p = [ Z f PxY.i] /K, (29) 
and let Y^) be the following weighted average of the converted values: 

Yi( yi ) = [ PxYjlYij^i)] / 1 PxYjll- ( 30 ) 

We may think of Yj(y{) as a transformation of into an 'average 
scale' of the K essay scores determined by the equatings done in step 1 with 
weights proportional to the correlations, Pxy-1- Livingston's final adjusted 

essay scores, y ad j, can be expressed in this notation as 

yadj = a- p)yi+ p W (3D 

The important feature of Livingston's proposal, in our opinion, is its 
form expressed in (31) rather than the particulars of its definition. The raw, 
unadjusted value, yj, is averaged with the average converted value, Y^). 
The weight used in the averaging in (31) reflects Livingston's degree of 
belief in the relative importance of the converted scores and the original 
unadjusted score. (Livingston is quite explicit that he regards the evidence 
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for making an adjustment to the essay score Yj as the greatest when PxYjl * s 

1 .0, and the least when this correlation is zero.) 

We now turn to the equatings referred to in step 1 . The method that 
Livingston proposes for finding the converted scores that we denote by 
Yjj*^) may be interpreted as an example of 'equating to a common test,' 

Angoff (1971), or 'equating through another test,' Braun and Holland (1982) 
and is often referred to as 'chain equating' by ETS test statisticians 
(although, in chain equating the operational method used is usually 
equipercentile rather than linear equating). The idea is to linearly equate Yj 
to X on Pj, then to equate X to Yj on Pj and finally to compose or 'chain 
together' these two equatings to get a linear transformation from the Yj-scale 
to the X-scale to the Yj-scale. The first linear equating results in the 
function: 



Xifri) = Mxil + (°Xil / °ilXyi " Mil). ( 32 ) 

The subscript i on Xj() indicates that the equating is on Pj. The second 
equating results in the function: 

Yj(x) = li jl + (oj! / a xjl )(x - |i Xjl ), (33) 

where jixji an d °xjl are me mean and standard deviation of X for the 
examinees selecting topic j, defined earlier in Theorems 4 and 5. When the 
two functions, Yj(x) and Xj(yj), are composed or 'chained together' we get 

Yy*(y i ) = Y j (X i (y i )) 

(34) 

= jXj! + (Oj! / OxjiXUxii " Mxjl) + (<*Xil / °Xjl) (°jl / a il)(yi " MiiX 
which is formula (7) of Livingston (1988), in our notation. 



LIVINGSTON'S MISSING DATA ASSUMPTIONS: A key theoretical 
requirement of test equating is that the resulting equating function should not 
depend on the population on which it is computed. This gives us a tool for 
identifying the assumptions about the missing data that Livingston's 
proposed procedure implicitly makes. Braun and Holland (1982, pg. 37) 
point out that in order for chain equating to give unbiased results the two 
equating functions that are chained together, (i.e., (32) and (33)) should not 
depend on which population is used for the equating. In the present case this 
means that equating Yj to X on Pj ought to give the same equating function 
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as equating Yj to X on Pj, where, in fact, Y i is missing data. If we were able 
to compute the linear equating function of Yi to X on Pj, the result would be 

Xijfri) = Hxjl + ( a Xjl / °ijl)(yi - mji). ( 35 > 

where 

jiij! = E(Yi I Rj = 1), and a^! 2 = VarCYj I Rj = 1), (36) 

The only way two linear functions can be identical is for their slopes and 
intercepts to be the same. We conclude that the implicit assumptions made in 
Livingston's proposal are that the missing data, Y^ when Rj = 1 and i * j, 
satisfies these conditions 

°Xjl /o ijl = CJ Xil /a il' or 

Var(Yi I Rj = 1) = a u 2 (a xjl 2 / o^ 2 ) (37) 

and 

Mxjl - (°Xjl / °ijl) mjl = Mxil - (°Xil 1 Mil. or 

E(Yi I Rj = 1) = Mil + (qi/ CfxnX Mxjl - Mxil)- (38) 

We may use (37) and (38) to find estimates of m and Oj 2 that are 
consistent with Livingston's assumptions about the missing data. They are 
summarized in Theorem 6. 

Theorem 6. If E(Yj I Rj = 1) and Var(Yi I Rj = 1) are given by (38) and (37), 
respectively, for all i and j, then 

m = E(Y| ) = + (o n / a xi i)( \i x - to!) (39) 

and 

Oi 2 = Var(Yi ) = (a n 2 / a xi i 2 ) o x 2 . (40) 

The proof of the theorem follows from multiplying both sides of (37) 
and (38) by pj, summing over all j and interpreting the results. These results 

mean mat (a) the mean, \i v stands in the same relation to U^, as the mean, 
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|j, x , does to Mxii* in terms of the standard deviations, and Cxn, 
respectively and (b) the ratio G± of to is the same as the ratio of a x to 

°Xil- 

AN ALTERNATIVE PROCEDURE: Consider an adjusted essay score of 
the form 

ai =(1-Wi) yi + WiQCyi), (41) 

where C^) is a transformation of y^ to a common scale, and Wj is a weight. 
We will discuss CjO and Wj in turn. 

In Livingston's procedure CiO is YjO defined by (30) and (34). The 
scale to which this choice of C{Q maps yj is a weighted average of the scales 
of all the essays scales using the Pxy-1 ^ tne weights. The reason for using 

a weighted average of these scales is to avoid the arbitrary choice of the 
scale of one of the essay topics as the scale for the other essay topics. The 
resulting 'average' scale is somewhat unfamiliar. To avoid this we propose 
using the scale determined by the mean and variance of the total pool of raw 
essay scores. Let 

?Y = S i miPi and °Y 2 = S i c il 2 Pi + S i Oki " My) 2 Pr ( 42 ) 

then jJ-y and Gy 2 are the mean and variance of the entire set of essay scores 
ignoring that they come from different topics. If we obtain estimates of jij 
and Cj 2 by making some particular assumptions about the missing data then 
the transformation 

City) = JT Y + ( a Y / CjX y { - m) (43) 

will map the scale of Yj to the scale of the raw essay scores with mean fiy 
and variance Gy 2 - 

The transformation given in (43) can be used with any set of 
assumptions about the missing data that lead to estimates of m and Oj 2 . In 

particular, using (39) and (40) we can obtain a version of (43) that makes use 
of Livingston's assumptions about the missing data. There are, however, 
other alternatives to the chain equating used by Livingston. The most well- 
known is linear, anchor-test equating in which X is used as the anchor-test 
rather than as an intermediate test to which Yj and Y: are both equated 
(Angoff, 1971, 1982; Braun and Holland, 1982; Petersen, Kolen and 
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Hoover, 1989). This method makes explicit assumptions about the missing 
data that are different from those made by Livingston. These assumptions 
are related to 'ignorable non-response,' Little and Rubin (1987). 

Consider the conditional distribution of given X and the missing 

data indicator variable, R^ i.e. 

Prob{ Yi = y I X = x and Ri = r} (44) 

where r = 0 or 1. If the probabilities in (44) do not depend on r, then the 
missing data for Yj is said to be ignorable given X. A consequence of 

ignorability is that the regression function 

E(Y i IX = xandR i = r) (45) 

and the variance function 

Var(Y i I X = x and Ri = r) (46) 

do not depend on r. If, in addition, we make the further modeling 
assumptions that the regression function is linear and the variance function 
is constant we obtain the two basic assumptions of linear anchor-test 
equating (also known as Tucker equating): 

E(Y i IX = xandR i = 0) = E(Y i IX = xandR i = 1) 

= n u + (o ix / Oxi^PxYjiC x - JIjqi) (47) 

and 

Var(Yi I X = x and Ri = 0) = Var(Yi I X = x and Rj. = 1 ) 

= ^il 2 (l-P 2 XY i l)- ( 48 > 

Theorem 7 summarizes the resulting expressions for |Xj and of that follow 
from (47) and (48). 

Theorem 7: If (47) and (48) hold then \i { and of are given by: 

(a) fc= (1 - p XY; i) Mil + PxYJ^il + ^il 7 a XilX Mx " MxilM ( 49 > 
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(b) Ci 2 = (1 - p 2 XY t l)Oil 2 +• (^XYjlC^il^Xil 2 )^ 2 - 



(50) 



We have written (a) and (b) in Theorem 7 in ways that emphasize that 
under an ignorable (given X) missing data mechanism, fij and c^ 2 are 
weighted averages (using Pxy-1 P 2 XY1 ^ weights) of (a) the estimates 
of Hi and a j 2 that are implicit in Livingston's procedure (equations 39 and 
40), and (b) the raw mean and variance, and c^ 2 . Livingston's objection 
to using linear anchor-test equating as the sole basis for adjusting the essay 
scores is that when PxY t l = 0 approach will result in assuming that 

m = M-ii and, c^c^ 2 (51) 

which will cause a large adjustment to the essay scores when the differences 
between the \l n are large; when the PxYjl = 0 ^ s 1S not desirable, in 

Livingston's opinion. 

One way to avoid this is to use an adjusted essay score of the form 
(41) with 

Wi= RxYji, (52) 

rather than Livingston's choice of 

W i= p. (53) 

Our alternative to Livingston's proposal uses an adjusted essay score 
of the form (41), using (52) as the weight on the converted score and using 
(43) to define the converted score. In our view the natural choice for and 

in (43) come from the assumption of ignorable non-response (i.e., 
Theorem 7) rather than Livingston's proposal to equate Yj to Yj through X. 
However, our approach requires the user to make an explicit assumption 
about the missing data (i.e., choose an equating method), so using either 
Theorem 6 or 7 or some other estimates of ^ and Oj is compatible with our 

approach. The result is an adjusted essay score of the form: 

*i - 0 - Px^l) yi + PxYjlt Sy + ( <V °iX y-i " M-i ) 1> ( 54 > 
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where and are defined in Theorem 7 (if linear anchor-test equating is 
used to equate Yj to Yj) or in Theorem 6 (if chain equating is used to equate 
YitoYj). 

Our proposal has many of the features of Livingston's proposal given 
in equation (31). In particular, it tends to dampen the amount of the 
adjustment that is made to the scores on topic i by the size of the correlation 
between X and Yi-the smaller the correlation, the smaller the adjustment. It 

is also a simple linear function of the essay score alone, i.e., two examinees 
with identical Y^scores but different X-scores will get exactly the same 
adjusted Y r score. It differs from Livingston's in that different assumptions 
about the missing data, i.e., Theorems 6 or 7, can be used to compute the 
converted scores, 0^), from (43). Finally, the scale of the adjusted essay 
scores in (54) is closely related to the original unadjusted essay scores 
through CjO- 

8. EXAMPLE 3. 

Table 3 gives, for the AP European History data in Table 1, the values 
of m and Ci under the two different sets of assumptions about the missing 

data given in Theorems 6 and 7. 

(Insert Table 3 about here) 

The two sets of assumptions about the missing data give similar 
estimates of jij and except for ^3 and In addition, all of the estimates 
of Gi for the chain-equating assumptions equal or slightly exceed the 
corresponding a t estimates for the anchor-test equating assumptions. The 
conclusions about the relative difficulty of the essay topics differ somewhat 
across the two sets of assumptions. For the anchor-test assumptions, topic 3 
is the easiest (i.e., least severely graded), topics 2 and 5 are the next easiest, 
topic 6 the next easiest, and topics 4 and 7 the most difficult (i.e., most 
severely graded). For the chain-equating assumptions, topics 2, 3, 5, and 7 
are the easiest and are about equally difficult, topic 6 is more difficult and 
topic 4 is the hardest. Both sets of assumptions include topics 4 and 6 among 
the most severely graded, but they differ substantially on their assessment of 
the difficulty of topic 7. Topic 3 has the highest estimated mean score under 
both sets of assumptions, but the anchor-test assumptions assess it as a half a 
raw-score point easier than the next easiest essay topic. The chain-equating 
assumptions assess topic 3 as only slightly easier than the other three easy 
topics. 
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(Insert Tables 4, 5, and 6 about here) 

Tables 4, 5 and 6 give the adjusted scores using Livingston's method 
and the two methods we have proposed. We used formula (31) for 
Livingston's method, and formula (54) with m and given by Theorems 6 

and 7 for our two methods. All three methods do not make strong 
adjustments to the essay scores and if rounded to the nearest integer, 
Livingston's Ad Hoc procedure and the version of our procedure that uses 
chain equating make no adjustment at all to the essay scores. The version of 
our procedure that uses anchor-test equating yields adjusted scores in this 
example that do not round to the original essay scores in 7 cases. The scores 
of 1 through 6 for essay topic 3 are adjusted downwards to scores that round 
down one integer. This is in line with the anchor-test equating estimate of ji 3 

= 8.0, making it the easiest of the essay topics. The score 15 for essay topic 6 
is adjusted up to a score that rounds up one integer. One might have thought 
that the anchor-test assumptions would have also resulted in stronger 
upwards adjustments of the scores for topics 4 and 7. The reason they do not 
is that the correlations Pxy 3 1 RXY 7 1 ^ smaller than the others— from 

Y 2 to Y 7 these correlations are: .43, .48, .39, .46, .49, and .37. 

Thus, in this example the adjustments are small for all the methods. 
This is compatible with the results we found earlier regarding the 
acceptability of Livingston's Null Hypothesis for these data. 

9. ADJUSTING COMPOSITE SCORES. 

As mentioned earlier, in the applications that we have in mind, such as 
the Advanced Placement Examinations, the raw scores for the test are 
weighted composites of score on the mandatory test, X, and the optional 
essay, of the form: 

Sj= X + wYi. (55) 

The X-score will usually be equated by conventional anchor-test 
methods to older forms of the multiple-choice part, so when we refer to X it 
is often already converted to an 'X-scale' (although it need not be in 
particular applications). 

The weight , w, is often of the form 

w = Fo x / c Y , (56) 
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so that w is proportional to the ratio of the standard deviation of X to that of 
the essay scores where no distinction is made between the essay topics. The 
multiplier F can reflect the relative importance given to the essay score 
compared to the multiple-choice score X, e.g., F might reflect the amount of 
time spent on the essay compared to that spent on the multiple-choice part, 
or possibly the ratio of their reliabilities. 

One approach to forming the composite score is simply to replace Y t 

by the adjusted score A- v from (41) or (54), i.e., 

Si, adjusted = X + w A i- < 57 > 

We might call this the 'plug-in' procedure, for obvious reasons. While 
probably reasonable in many situations, the plug-in procedure may be 
objected to on the grounds that the scores that need to be equated are the raw 
scores that give rise to the reported scores and these are the raw composite 
scores not the raw essay scores. An alternative to adjusting the essay scores 
first and then using the adjusted scores in the composite is to adjust the 
composite scores directly. In this section we amplify the discussion of 
section 7 to the case of a composite score. 

Usually the weight w can be computed from the data that is on hand 
prior to any adjustment procedure; in any event, we will assume w is a 
known value. The composite score, Si = X + w Y- v is very much like the 
score, Y i? in that it is only observed for the sub-population of examinees who 
write on topic i, We propose an adjusted composite score, s-*, of the 
form: 

Si* = d - PxSil) s i + PxSjlt + ( a sX Vi ' HSj ) 1. ( 58 ) 

where 

Sj is the unadjusted composite score from (55), 

Pxsi 1S *h e correlation of X and on 

\i§ and G§ 2 are the mean and variance of the composite scores 
formed without regard to the essay topics involved, and 

|^s. and G$ 2 are the mean and variance of Si over the whole 
population of examinees formed by making some assumptions about the 
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missing data, i.e., the values of Sj for the examinees who did not choose 
topic i. 

As before, we suggest two alternative sets of assumptions that 
correspond to chain equating and anchor-test equating, respectively. Here 
are the resulting formulas for u^. and a§. 2 for these two sets of missing data 

assumptions. 
Chain equating case : 

H S . = E(Si ) = J + (a Sil / a^X n x - Uxn) (59) 

and 

c s .2 = Var(Si ) = (a Sil 2 / o^i 2 ) c X 2 - (60) 
Anchor-test equating case : 

^S, = C 1 " PXS^) M-Sjl + PxSil^Sjl + te^l* °Xil)( MX " M-Xil)! ( 61 ) 

and 

= d - P^cs^l 2 + P^SilC^l^Xil 2 )^ 2 - ( 62 ) 

In (59) to (62), ji Sil , a s>1 2 , and p X Sjl denote the mean, variance and 

correlation with X of for the examinees who choose topic i. They replace 

m l5 Gn 2 , PXY-1 m Theorems 6 and 7, respectively. 

The adjusted composite score defined in (58) has the feature that the 
effect of the equating is dampened by the correlation between X and the 
composite score, Pxs^- Because contains X, these correlations will 

usually be much higher than those between X and Yj and the overall effect 

will be to put most of the weight on the equated composite scores rather than 
the unadjusted composite score, s^. 



10. DISCUSSION AND SUMMARY. 

Our proposals, e.g., formulas (41), 54), and (58), are 'ad hoc' in the 
same sense that Livingston's is because there is no real justification for the 
simple 'weighted average' form of the adjusted scores or for the choice of 
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weight, Wi = PxYjl- simple form (41) does have reasonable properties 

because it dampens the amount of adjustment made to the essay scores, and 
in the face of the test developers' attempts to make the scores on the optional 
essays comparable this does seem like a reasonable thing to do. That (41) is 
a simple weighted average is also good because it is easy to understand. 

The larger problem is the choice of weight, Wj. It seems to us that Wj 

should reflect the degree of belief of the user in the assumptions made about 
the missing data. These assumptions are always untestable when the missing 
data is really missing, as it is always is in the Optional Essay Problem. 
Hence, making any particular assumption about the missing data always 
involves some degree of belief unsupported by the data. Setting Wj = Pxy i 

has no basis other than the intuition that the greater PxYjl * s tne more 
plausible it is to believe assumptions underlying the equating. Livingston 
mentions that any appropriate increasing function of Pxy-1 * s a possible 

candidate for the weight. Our analyses do not even suggest that Wj ought to 
be any function of PxY t l' but l^e Livingston, until there is a better proposal 
for Wj we think Wj = Pxy-1 * s a reasonable place to start. 

If the correlation between X and Yj is used as the weight, W- v in 
formula (54) it may be useful to consider replacing PxYjl m (52) by a 
correlation, Pxy» ^ at nas Deen 'corrected' for 'restriction of range'. This 

correction can sometimes be substantial. One way to develop a restriction of 
range correction for Pxy i 1S t0 assume that the missing data for Yj are 

ignorable given X, as is done in anchor-test equating discussed in Theorem 
7. The result of assuming ignorability given X is the well-known formula 
(i.e., Pearson, 1903) for the restriction-of-range-adjustment to Pxy^ wmcn 

is, in our notation, 

PXY^PXYil^X^Xil)^ 1 - P 2 XYil+ P^l^V^Xil)] 172 - ( 63 > 

If (63) is used, then to be consistent the anchor-test equating assumptions for 
the missing data, i.e., Theorem 7, should be used rather than the chain 
equating assumptions of Theorem 6. In our example the corrected 
correlations are: .44, .51, .40, .48, .51, and .37. They are very nearly the 
same as the uncorrected correlations mentioned at the end of section 8 and 
the resulting adjusted essay scores are very nearly the same as those in 
Tables 5 and 6 that use Pxy-1 as me weights. 
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In summary, we have introduced the Optional Essay Problem as a 
useful paradigm example of examinee choice, and considered the problem of 
equating the optional essays from two points of view. First, we looked at 
how we might marshal evidence that the essays don't need to be equated at 
all (sections 1 to 6). Second, we examined Livingston's proposal for 
adjusting the essay scores and put it into the context of ordinary test equating 
(section 7). This has two benefits. First, we can see how to develop several 
new alternative methods of essay score adjustment that each make different 
assumptions about the missing data (sections 7 and 8). Second, all of these 
methods easily generalize to the problem of equating the composite scores of 
which the optional essays are a part( section 9). 
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Table 1 



Data from the 1987 Advanced Placement European History Exam 



Subgroup 
Selecting 
Essay 


Optional Essav 
Hil a il 


Multiple Choice 
Mxil a Xil 


a XYjl 


Pi 


2 


7.5 


2.5 


51.9 


16.3 


17.6 


.32 


3 


8.4 


2.3 


57.7 


15.8 


17.4 


.13 


4 


6.5 


2.6 


53.3 


16.6 


16.7 


.16 


5 


7.4 


2.5 


52.0 


15.9 


18.1 


.10 


6 


6.7 


2.4 


51.7 


16.1 


19.0 


.11 


7 


5.9 


2.5 


42.7 


16.9 


15.8 


.18 


Total 


7.1 


2.6 


51.3 


16.9 


20.2 


1.00 



JO 

32 



Table 2 

Data from a hypothetical example showing a reversal, Livingston (1988) 
Subgroup 



Selecting 


Optional Essav 


Multiple Choice 




Essay 


fti 


^1 


Mxil 


a Xil 


Pi 


2 


6.4 


2.9 


45.2 


16.7 


.06 


3 


6.9 


2.4 


47.7 


15.6 


.65 


4 


6.6 


2.8 


47.9 


18.1 


.02 


5 


5.9 


2.4 


48.8 


15.6 


.18 


6 


7.1 


2.8 


41.5 


17.4 


.09 



ERIC 



Jo 



33 



Table 3 

Estimates of m and Gj using two sets of missing data assumptions. 



Subgroup 


Anchor-test equating 


Chain-equating 


Selecting 


assumptions 




assumptions 


Essav 

at 


Hi 


°i 


Hi <*i 


2 


7.5 


2.5 


7.4 2.6 


3 


8.0 


2.3 


7.5 2.5 


4 


6.4 


2.6 


6.2 2.7 


5 


7.4 


2.5 


7.3 2.7 


6 


6.7 


2.4 


6.6 2.5 


7 


6.4 


2.5 


7.2 2.5 



f ,- i 

o < 
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Table 4 

Adjusted Essay Scores Using Livingston's 'Ad Hoc' Procedure 
Essay Topics 



Score 


2 


3 


4 


5 


6 


7 


1 


0.9 


0.7 


1.4 


1.0 


1.1 


0.9 


2 


1.9 


1.7 


2.4 


2.0 


2.1 


1.9 


3 


2.9 


2.7 


3.4 


3.0 


3.1 


2.9 


4 


3.9 


3.8 


4.4 


3.9 


4.2 


3.9 


5 


4.9 


4.8 


5.4 


4.9 


5.2 


4.9 


6 


5.9 


5.8 


6.4 


5.9 


6.2 


5.9 


7 


6.8 


6.8 


7.4 


6.9 


7.2 


6.9 


8 


7.8 


7.8 


8.3 


7.9 


8.2 


8.0 


9 


8.8 


8.8 


9.3 


8.9 


9.2 


9.0 


10 


9.8 


9.9 


10.3 


9.8 


10.2 


10.0 


11 


10.8 


10.9 


11.3 


10.8 


11.2 


11.0 


12 


11.8 


11.9 


12.3 


11.8 


12.2 


12.0 


13 


12.8 


12.9 


13.3 


12.8 


13.2 


13.0 


14 


13.8 


13.9 


14.3 


13.8 


14.2 


14.0 


15 


14.8 


14.9 


15.2 


14.8 


15.2 


15.0 
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Table 5 

Adjusted Essay Scores Using (54) and Anchor-test Equating 



Essay Topics 



Score 


2 


3 


4 


5 


6 


7 


1 


0.7 


0.1 


1.3 


0.8 


1.0 


1.2 


2 


1.7 


1.2 


2.3 


1.8 


2.0 


2.2 


3 


2.8 


2.3 


3.3 


2.8 


3.1 


3.2 


4 


3.8 


3.3 


4.3 


3.8 


4.1 


4.2 


5 


4.8 


4.4 


5.3 


4.8 


5.1 


5.2 


6 


5.8 


5.4 


6.3 


5.8 


6.2 


6.3 


7 


6.8 


6.5 


7.3 


6.9 


7.2 


7.3 


8 


7.8 


7.6 


8.3 


7.9 


8.3 


8.3 


9 


8.9 


8.6 


9.3 


8.9 


9.3 


9.3 


10 


9.9 


9.7 


10.3 


9.9 


10.3 


10.3 


11 


10.9 


10.8 


11.3 


10.9 


11.4 


11.3 


12 


11.9 


11.8 


12.3 


12.0 


12.4 


12.4 


13 


12.9 


12.9 


13.3 


13.0 


13.5 


13.4 


14 


13.9 


13.9 


14.3 


14.0 


14.5 


14.4 


15 


15.0 


15.0 


15.3 


15.0 


15.5 


15.4 
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Table 6 

Adjusted Essay Scores Using (54) and Chain Equating 
Essay Topics 



Score 


2 


3 


4 


5 


6 


7 


1 


0.9 


0.7 


1.4 


1.0 


1.1 


0.9 


2 


1.9 


1.7 


2.4 


2.0 


2.2 


1.9 


3 


2.9 


2.7 


3.4 


3.0 


3.2 


2.9 


4 


3.9 


3.7 


4.4 


4.0 


4.2 


3.9 


5 


4.9 


4.8 


5.4 


5.0 


5.2 


4.9 


6 


5.9 


5.8 


6.4 


5.9 


6.2 


5.9 


7 


6.9 


6.8 


7.3 


6.9 


7.3 


7.0 


8 


7.9 


7.8 


8.3 


7.9 


8.3 


8.0 


9 


8.9 


8.8 


9.3 


8.9 


9.3 


9.0 


10 


9.9 


9.9 


10.3 


9.9 


10.3 


10.0 


11 


10.9 


10.9 


11.3 


10.9 


11.3 


11.0 


12 


11.9 


11.9 


12.3 


11.8 


12.4 


12.0 


13 


12.9 


12.9 


13.3 


12.8 


13.4 


13.1 


14 


13.9 


13.9 


14.2 


13.8 


14.4 


14.1 


15 


14.9 


15.0 


15.2 


14.8 


15.4 


15.1 
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Figure 1: 

Graph of (5) & (6) for AL = 0.90 & AU = 1.10 




Figure 2: 

Graph of (5) & (6) for AL = 0.90 & AU = 1.10 




Figure 3: 

Graph of (5) & (6) for AL = 0.95 & AU = 1.05 




Figure 5: 

Graph of (21) & (22) for A1=B1=0.95, AU=BU=1.05 



Covariance 
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Y2u 


m 


Y21 
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A 
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A 
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• 


Y6u 


• 
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A 


Y7u 


A 
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